Static model
Appendix
$\DeclareMathOperator{\tr}{tr}$
The Fristonian free-energy is
F=U−Swhich comprises an energy term
U=⟨lnp(y,ϑ|m)⟩q=⟨L⟩qand a (negative) Shannon entropy term
S=⟨lnq(ϑ)⟩With mean-field approximation over the variational density, $q$
q(ϑ)=∏iqi(ϑi)=∏iqi.Let us for now assume the following case
q(ϑ)=qi(ϑi)qj(ϑj)=qiqj,j=∖iand note that the following satisfies $\delta F / \delta q = 0$
where $V_i=V(\vartheta_i)$ is known as the variational energy and speask to the joint probability of data and parameters, $L$, expected under its Markov blanket, $q_j$.
Under Laplace assumption, let us suppose that
are Gaussian, where
is the covariance matrix.
Following the factorisation, we write
F=U−HU=⟨L(ϑ)⟩qiqjH=⟨lnqi⟩qi+⟨lnqj⟩qjThe (neg-)entropies are (see note A.1)
H=−Di2(1+ln2π)−12ln|Σi|−Dj2(1+ln2π)−12ln|Σj|=−Di2ln2πe−12ln|Σi|−Dj2ln2πe−12ln|Σj|Further, with second-order truncation (dropping bilinear terms)
where $L^{(\vartheta_i\vartheta_i)}$ denotes $\partial^2 L/\partial\vartheta_i\partial \vartheta_i$ evaluated at its mode $\mu$,
and write
Here, one uses the property: $a^TCa = \tr[aa^TC]$.
Thus, the free-energy becomes
F=L(μ)+12[tr(ΣiL(ϑiϑi))+tr(ΣjL(ϑjϑj))]+Di2ln2πe+12ln|Σi|+Dj2ln2πe+12ln|Σj|Solve for $\Sigma_i$ (or just by examination through completing the square)
δF/δΣi=0=12L(ϑiϑi)T+12Σ−1iΣ−1i=−L(ϑiϑi)T=−L(ϑiϑi)
Note that the same procedure can be applied to individual variational density to give
This is useful when optimse for variational mode $\mu_i$.
To optimise the sufficient statistics for the variational density, $q$, under Laplace approximation, we only need to find their mean and covariance, which are illustrated in the preceding sections
μi=argmaxϑiV(ϑi)Σi=−L(μ)(ϑjϑj)−1It is worth noting that the covariance is a function of its variational mode and does not depend on the mean-field approximation.
The update scheme should read V(μi)(ϑi:k)=L(μ)(ϑi:k)+12tr(ΣjL(ϑjϑjϑi:k))V(μi)(ϑi:kϑi:l)=L(μ)(ϑi:kϑi:l)+12tr(ΣjL(ϑjϑjϑi:kϑi:l))Δμi=(exp(tJ)−I)J−1V(μi)(ϑi)≈−V(μi)(ϑiϑi)−1V(μi)(ϑi),if t→∞
where superscript $(\vartheta_{i:k})$ denotes partial derivative to the $k$-th element of $\vartheta_i$. For detailed derivation see linearisation.
Suppose one has parameter that factorises into $\vartheta = \{\theta, \varphi\}$
and let $\theta=(\theta_1, \theta_2)^T$ and $\varphi = (\varphi_1, \varphi_2)^T$
Unpack $L^{(\theta\theta\varphi_1\varphi_1)}$ in the second term
It is straightforward to show that, for instance, $\tr\left[\Sigma_\theta L^{(\theta\theta\varphi_1\varphi_2)}\right]$ is equivalent to $\Sigma^T\!\!\!:_\theta^T L\!\!:^{(\theta\theta)(\varphi_1\varphi_2)}$, where $\Sigma\!\!:$ denotes matrix vectorisation, and $\Sigma^T\!\!\!:$ stands for vectorisation along columns.
Thus, $L\!\!:^{(\theta\theta)(\varphi\varphi)}$ becomes a matrix that reads
and one accordingly has
The same procedure can be repeated for $V^{(\theta\theta)}, V^{(\varphi)}$ and $V^{(\theta)}$ to give
In this example we will attempt to optimise the following objective function using the updating scheme above.
$ L(u, v) = -\frac{1}{16} (u - \frac{1}{8}v^2)^2 - \frac{1}{128} (v^2 - 2)^2 $
In [1]:
import numpy as np
import scipy.stats as stat
import matplotlib.pyplot as plt
from IPython import display
import time
%matplotlib inline
In [2]:
import theano
import theano.tensor as T
print("Theano version: ", theano.__version__)
In [3]:
c0 = [1. / 16, 1. / 8, 1. / 128]
u, v = T.vectors(2)
Su, Sv = T.matrices(2) # this is not necessary, only for the purpose of tutorial
L = (-c0[0] * (u - c0[1] * v ** 2) ** 2 - c0[2] * (v ** 2 - 2) ** 2)[0]
In [4]:
fSu, fSv = Su.flatten(1), Sv.flatten(1)
Lu = T.jacobian(L, u)
Lv = T.jacobian(L, v)
Luu = T.hessian(L, u).flatten(1)
Lvv = T.hessian(L, v).flatten(1)
Luuv = T.join(0, [T.jacobian(Luu[i], v) for i in range(1)])
Lvvu = T.join(0, [T.jacobian(Lvv[i], u) for i in range(1)])
Luuvv = T.join(0, [T.hessian(Luu[i], v).flatten(1) for i in range(1)])
Lvvuu = T.join(0, [T.hessian(Lvv[i], u).flatten(1) for i in range(1)])
Uu = T.join(0, [T.sum(fSv * Lvvu[:, i]) for i in range(1)])
Uv = T.join(0, [T.sum(fSu * Luuv[:, i]) for i in range(1)])
Uuu = T.join(0, [T.sum(fSv * Lvvuu[:, i]) for i in range(1)])
Uvv = T.join(0, [T.sum(fSu * Luuvv[:, i]) for i in range(1)])
Vu = Lu + 0.5 * Uu
Vv = Lv + 0.5 * Uv
Vuu = (Luu + 0.5 * Uuu).reshape((1, 1))
Vvv = (Lvv + 0.5 * Uvv).reshape((1, 1))
F = L + 0.5 * (Su.dot(Luu) + Sv.dot(Lvv)) + T.log(T.nlinalg.det(Su)) + T.log(T.nlinalg.det(Sv))
du = -(T.nlinalg.pinv(Vuu).dot(Vu))
dv = -(T.nlinalg.pinv(Vvv).dot(Vv))
# alternatively
# du = (T.exp(Vuu) - 1).dot(T.nlinalg.pinv(Vuu).dot(Vu))
# dv = (T.exp(Vvv) - 1).dot(T.nlinalg.pinv(Vvv).dot(Vv))
Cu = - T.nlinalg.pinv(Luu.reshape((1, 1)))
Cv = - T.nlinalg.pinv(Lvv.reshape((1, 1)))
In [5]:
update = theano.function([u, v, Su, Sv], [du, dv, Cu, Cv], allow_input_downcast=True)
elb = theano.function([u, v, Su, Sv], F, allow_input_downcast=True)
In [6]:
tu, tv = 1., 1. # delta_u, delta_v
u0, v0, Su0, Sv0 = np.array([1.]), np.array([1.]), np.array([[1.]]), np.array([[1.]])
tCu, tCv = Su0, Sv0
tol = 0.001
In [7]:
plt.figure()
plt.hold(True)
step = 0
for maxM in range(256):
plt.subplot(211)
plt.plot(v0, u0, 'r*', alpha=1)
for maxN in range(256):
tu, tv, tCu, tCv = update(u0, v0, Su0, Sv0)
u0 += tu
v0 += tv
plt.subplot(211)
plt.plot(v0, u0, 'b.', alpha=.5);
plt.subplot(212)
plt.plot(step, elb(u0, v0, Su0, Sv0), 'r+');
display.display(plt.gcf())
display.clear_output(wait=True)
time.sleep(0.)
step += 1
if (tu ** 2) < tol and (tv ** 2) < tol: break
if np.sum((Su0 - tCu) ** 2) < tol and np.sum((Sv0 - tCv) ** 2) < tol: break
Su0, Sv0 = tCu, tCv
print("Converged.")
print("u: ", u0, "v: ", v0, "Cu: ", Su0, "Cv: ", Sv0)
Instead of writing multiple lines of code, in this example we create a function vilaps()
which allows us to reuse the updating scheme given different model/objective function.
This function uses Theano to generate a compiled updating scheme, with optional observation y
and hyperprior hp
arguments.
In [8]:
import scipy.stats as stat
import numpy as np
import theano
import functools
import theano.tensor as T
print("Theano version: ", theano.__version__)
In [9]:
def vilaps(L, b, bdims, y=None, hp=None):
def _jac_(*args, **kwargs):
return T.jacobian(*args, disconnected_inputs='warn', **kwargs)
def _hes_(*args, **kwargs):
return T.hessian(*args, disconnected_inputs='warn', **kwargs)
assert len(b) == len(bdims)
if y is None:
y = []
else:
y = [y]
if hp is None:
hp = []
mfp = len(bdims)
Li = [_jac_(L, b[i]) for i in range(mfp)]
Lii = [_hes_(L, b[i]).flatten(1) for i in range(mfp)]
Liij = [[T.join(0, [_jac_(Lii[i][k], b[j]) for k in range(bdims[i] ** 2)])
for j in range(mfp)]
for i in range(mfp)]
Liijj = [[T.join(0, [_hes_(Lii[i][k], b[j]).flatten(1) for k in range(bdims[i] ** 2)])
for j in range(mfp)]
for i in range(mfp)]
Ci = [- T.nlinalg.pinv(Lii[i].reshape((bdims[i], bdims[i]))) for i in range(mfp)]
fCi = [Ci[i].flatten(1) for i in range(mfp)]
Uj = [[T.join(0, [T.sum(fCi[j] * Liij[j][i][:, k]) for k in range(bdims[i])])
for j in range(mfp)
if j != i]
for i in range(mfp)]
Ujj = [[T.join(0, [T.sum(fCi[j] * Liijj[j][i][:, k]) for k in range(bdims[i] ** 2)])
for j in range(mfp)
if j != i]
for i in range(mfp)]
Vi = [Li[i] + 0.5 * functools.reduce(lambda x, y: x + y, Uj[i]) for i in range(mfp)]
Vii = [(Lii[i] + 0.5 * functools.reduce(lambda x, y: x + y, Ujj[i])).reshape((bdims[i], bdims[i]))
for i in range(mfp)]
bGN = [-T.nlinalg.pinv(Vii[i]).dot(Vi[i]) for i in range(mfp)]
bLM = [(T.exp(Vii[i]) - 1).dot(T.nlinalg.pinv(Vii[i])).dot(Vi[i]) for i in range(mfp)]
F = L
for i in range(mfp):
F += 0.5 * T.sum(fCi[i] * Lii[i] + T.log(T.nlinalg.det(Ci[i])))
viupdate = theano.function(
b + hp + y,
bGN + bLM + Ci + [F],
allow_input_downcast=True)
return viupdate
This time we try to invert the following model that generates our observation $y$
$ p(y|μ,τ)=N(y|μ,τ−1)p(μ|τ)=N(μ|μ0,(λ0τ)−1)p(τ)=Ga(τ|a0,b0) $
where we assume mean-field approximation
$ q(\mu,\tau) = q(\mu)q(\tau) $
and Laplace assumption.
In [10]:
hv = [9., 3.]
hu = [5., 2.]
rngv = stat.gamma(a=hv[0], scale=hv[1])
N = 1000
Y = np.zeros((N, 1))
U = np.zeros(N)
V = np.zeros(N)
for s in range(N):
v = rngv.rvs()
rngu = stat.norm(loc=hu[0], scale=np.sqrt(1 / (hu[1] * v)))
u = rngu.rvs()
Y[s, :] = stat.norm(loc=u, scale=1 / v).rvs()
U[s] = u
V[s] = v
In [11]:
y = T.matrix()
u, v, pu, pv = T.vectors(4)
L1 = T.sum(-0.5 * (y - u).dot(T.nlinalg.matrix_inverse(T.diag(v))) * (y - u)) -\
0.5 * ((u - pu[0]).dot(T.nlinalg.matrix_inverse(T.diag(pu[1] * v)))).dot(u - pu[0]) -\
T.gammaln(pv[0]) * pv[1] ** pv[0] + (pv[0] - 1) * T.log(v) - v / pv[1]
L1 = T.sum(L1)
In [12]:
b = [u, v]
hp = [pu, pv]
bdims = [1, 1]
# The line below generates an warning (or raise an DisconnectedInput exception if the disconnected_input flag
# is otherwise not set for T.jacobian and T.hessian). This happens when AD takes a derivative of a constant.
viu = vilaps(L1, b, bdims, y, hp);
In [13]:
u0, v0 = np.array([2.]), np.array([5.])
tol = 0.0001
plt.figure()
for step in range(256):
gnu, gnv, lmu, lmv, Cu, Cv, FE = viu(u0, v0, hu, hv, Y)
tu, tv = gnu, gnv
u0 += tu
v0 += tv
plt.subplot(211)
plt.plot(u0, v0, 'bo', alpha=.5)
plt.subplot(212)
plt.plot(step, FE, 'r+')
display.display(plt.gcf())
display.clear_output(wait=True)
if np.square(tu) < tol and np.square(tv) < tol: break
In [14]:
plt.figure(figsize=(16, 5))
plt.subplot(121)
plt.plot(U, V, 'b.', alpha=.1);
plt.plot(hu[0], np.prod(hv), 'r*', markersize=12, alpha=.7);
plt.plot(u0, v0, 'y*', markersize=12, alpha=.7);
plt.grid(True)
plt.subplot(122)
plt.plot(U, V, 'b.', alpha=.1);
plt.plot(hu[0], np.prod(hv), 'r*', markersize=12, alpha=.7);
plt.plot(u0, v0, 'y*', markersize=12, alpha=.7);
plt.grid(True)
plt.xlim([4.8, 5.2])
plt.ylim([20, 40])
plt.legend(["sample", "true", "approx"]);
Out[14]:
This example repeats Example 1 using the function we just wrote.
In [15]:
c0 = [1. / 16, 1. / 8, 1. / 128]
u, v = T.vectors(2)
L2 = (-c0[0] * (u - c0[1] * v ** 2) ** 2 - c0[2] * (v ** 2 - 2) ** 2)[0]
viu = vilaps(L2, [u, v], [1, 1], hp=None, y=None)
In [21]:
plt.figure()
u0 = [1.]
v0 = [1.]
tol = 0.0001
for step in range(256):
gnu, gnv, lmu, lmv, Cu, Cv, FE = viu(u0, v0)
tu, tv = gnu, gnv
u0 += tu
v0 += tv
plt.subplot(211)
plt.plot(u0, v0, 'bo', alpha=.5)
plt.subplot(212)
plt.plot(step, FE, 'r+')
display.display(plt.gcf())
display.clear_output(wait=True)
if np.all(np.square([tu, tv]) < tol): break
print(u0, v0)
Say $\pmb x \sim N(\pmb\mu, \pmb\Lambda^{-1})$
Then,
$ \ln p(\pmb x) = -\frac D 2\ln 2\pi - \frac 1 2 \ln|\pmb\Lambda^{-1}| - \frac 1 2 (\pmb x-\pmb\mu)^T\pmb\Lambda(\pmb x-\pmb\mu) $
$ H(\pmb x) = \langle -\ln \pmb x\rangle = \frac D 2\ln 2\pi e + \frac 1 2 \ln |\pmb\Lambda^{-1}| $
$ D = \dim(\pmb x) $
The following derives a local linearisation of a dynamical system, proposed by Ozaki (1992).
Suppose we have a stochastic dynamical system with a continuous time Gaussian white noize $n(t)$.
˙y(t)=f(y(t)))+n(t)At first, we approximate the equation by a Gaussian process,
˙y(t)=Kty(t)+n(t)with some appropriate $K_t$ on a small interval $[t, t+\Delta t)$.
From the assumption that the Jacobian of the linear part of $(2)$ over each interval $[t, t+\Delta t)$ is given by the Jacobian $J_t = \frac{\partial f(y)}{\partial y}$ at $y = y_t$ of the original dynamical system, we have the following equation by using chain rule,
¨y(s)=d˙y(s)ds=∂f(y)∂y∂y∂s=Jt˙y(s)for $t \leq s \leq t + \Delta t$.
By integrating (3) on $[t, t + \tau)(0 \leq \tau < \Delta t)$ we have
˙y(t+τ)=exp(Jtτ)˙y(t)=exp(Jtτ)f(y(t))Integrating the equation $(4)$ again with respect to $\tau$ on $[0, \Delta t)$ utilizing the method of integration by parts, finally, we get the following equation which gives the value of $y(t+\Delta t)$ as a function of $y(t)$.
y(t+Δt)=y(t)+J−1t{exp(JtΔt)−1}f(y(t))